5  Hypothesis testing

Author

Vladimir Buskin

5.1 Preparation

  • Load packages:
library("readxl")
library("tidyverse")
  • Load the data sets:
data <- read_xlsx("Paquot_Larsson_2020_data.xlsx")

data_vowels <- read.csv("Vowels_Apache.csv", sep = "\t")

5.2 Hypothesis testing

The first step is to define the null hypothesis \(H_0\) and the alternative hypothesis \(H_1\) (or \(H_a\)).

Given two categorical variables \(X\) and \(Y\), we assume under \(H_0\) that both variables are independent from each other. This hypothesis describes the “default state of the world” (James et al. 2021: 555), i.e., what we would usually expect to see. By contrast, the alternative hypothesis \(H_1\) states that \(X\) and \(Y\) are not independent, i.e., that \(H_0\) does not hold.

In this unit, we will consider two scenarios:

  1. We are interested in finding out whether English clause ORDER (‘sc-mc’ or ‘mc-sc’) depends on the type of the subordinate clause (SUBORDTYPE), which can be either temporal (‘temp’) or causal (‘caus’).

Our hypotheses are:

  • \(H_0:\) The variables ORDER and SUBORDTYPE are independent.

  • \(H_1:\) The variables ORDER and SUBORDTYPE are not independent.

  1. As part of a phonetic study, we compare the base frequencies of the F1 formants for male and female speakers of Apache. We forward the following hypotheses:
  • \(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.

  • \(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.

Based on our data, we can decide to either accept or reject \(H_0\). Rejecting \(H_0\) can be viewed as evidence in favour of \(H_1\) and thus marks a potential ‘discovery’ in the data. However, there is always a chance that we accept or reject the wrong hypothesis; the four possible constellations are summarised in the table below (cf. Heumann, Schomaker, and Shalabh 2022: 223):

\(H_0\) is true \(H_0\) is not true
\(H_0\) is not rejected \(\color{green}{\text{Correct decision}}\) \(\color{red}{\text{Type II } (\beta)\text{-error}}\)
\(H_0\) is rejected \(\color{red}{\text{Type I } (\alpha)\text{-error}}\) \(\color{green}{\text{Correct decision}}\)

The probability of a Type I error, which refers to the rejection of \(H_0\) although it is true, is called the significance level \(\alpha\), which has a conventional value of \(0.05\) (i.e., a 5% chance of committing a Type I error).

5.2.1 The \(\chi^2\)-test

The next step is to compute a test statistic that indicates how strongly our data conforms to \(H_0\). For instance, the Pearson \(\chi^2\) statistic is commonly used for categorical variables. It requires two types of values: the observed frequencies \(n_{ij}\) in our data set and the expected frequencies \(m_{ij}\), which we would expect to see if \(H_0\) was true.

This table represents a generic contingency table where \(X\) and \(Y\) are categorical variables. Each \(x_i\) represents a category of \(X\) and each \(y_j\) represents a category of \(Y\). In the table, each cell indicates the count of observation \(n_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.

\(Y\)
\(y_1\) \(y_2\) \(y_J\)
\(x_1\) \(n_{11}\) \(n_{12}\) \(n_{1J}\)
\(x_2\) \(n_{21}\) \(n_{22}\) \(n_{2J}\)
\(X\)
\(x_I\) \(n_{I1}\) \(n_{I2}\) \(n_{3J}\)
# Cross-tabulate the frequencies for the variables of interest

freqs <- table(data$ORDER, data$SUBORDTYPE); freqs
       
        caus temp
  mc-sc  184   91
  sc-mc   15  113

We calculate the expected frequencies by using the formula

\[ m_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{number of observations}}. \]

# Compute expected frequencies

## Calculate row totals
row_totals <- rowSums(freqs)

## Calculate column totals
col_totals <- colSums(freqs)

## Total number of observations
total_obs <- sum(freqs)

## Calculate expected frequencies
expected_table <- outer(row_totals, col_totals) / total_obs

expected_table
           caus      temp
mc-sc 135.79404 139.20596
sc-mc  63.20596  64.79404

Given a sample with \(n\) observations and \(k\) degrees of freedom (\(df\))1, the \(\chi^2\)-statistic measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251):

\[ \chi^2 = \sum_{i=1}^{I}\sum_{i=j}^{J}{\frac{(n_{ij} - m_{ij})^2}{m_{ij}}}. \] The test stipulates that …

  1. all observations are independent of each other,
  2. 80% of the expected frequencies are \(\geq\) 5, and
  3. all observed frequencies are \(\geq\) 1.
# Compute chi-squared scores for all cells
## Create empty chi_squared_table for later storage
chi_squared_table <- matrix(NA, nrow = 2, ncol = 2,
                            dimnames = list(c("mc-sc", "sc-mc"), c("caus", "temp")))

# Loop: Repeat the following commands ...
for (i in 1:2) { # for each of the 2 rows and
  for (j in 1:2) { # for each of the 2 columns:
    observed_freq <- freqs[i, j] # 1. Get the observed frequencies
    expected_freq <- expected_table[i, j] # 2. Get the expected frequencies
    chi_squared_score <- ((observed_freq - expected_freq)^2) / expected_freq # 3. Compute chi-squared scores
    chi_squared_table[i, j] <- chi_squared_score # 4. Store output in the chi_squared_table
  }
}

chi_squared_table
          caus     temp
mc-sc 17.11278 16.69335
sc-mc 36.76575 35.86463

Or, more elegantly:

freqs_test <- chisq.test(freqs)

# Expected frequencies
freqs_test$expected
       
             caus      temp
  mc-sc 135.79404 139.20596
  sc-mc  63.20596  64.79404
# Chi-squared scores
(freqs_test$residuals)^2
       
            caus     temp
  mc-sc 17.11278 16.69335
  sc-mc 36.76575 35.86463
# Test statistics
freqs_test

    Pearson's Chi-squared test with Yates' continuity correction

data:  freqs
X-squared = 104.24, df = 1, p-value < 2.2e-16
Important

If the data does not meet the (expected) frequency requirements for the \(\chi^2\)-test, the Fisher’s Exact Test is a viable alternative (see ?fisher.test() for details).

5.2.2 The \(t\)-test

Since the \(\chi^2\) measure exclusively works with categorical variables, a separate test statistic is required if one of them is a continuous variable. The \(t\) statistic is often used for research questions involving differences between sample means. The way \(t\) is calculated depends on the sources of \(X\) and \(Y\): Are they from the same sample or from two (in-)dependent ones?

First, we consider two independent samples from a population:

  • Sample \(X\) with the observations \(x_1, x_2, ..., {x_n}_1\), sample size \(n_1\), sample mean \(\bar{x}\) and sample variance \(s^2_x\).

  • Sample \(Y\) with the observations \(y_1, y_2, ..., {y_n}_2\), sample size \(n_2\), sample mean \(\bar{y}\) and sample variance \(s^2_y\).

The \(t\)-statistic after Welch is given by:

\[ t(x, y) = \frac{|\bar{x} - \bar{y}|}{\sqrt{\frac{s^2_x}{n_1} + \frac{s^2_y}{n_2}}} \]

  • If there is more than one observation for a given subject (e.g, before and after an experiment), the samples are called dependent or paired. The paired \(t\)-test assumes two continuous variables \(X\) and \(Y\).

  • In the paired test, the variable \(d\) denotes the difference between them, i.e., \(x - y\). The corresponding test statistic is obtained via

\[ t(x, y) = t(d) = \frac{\bar{d}}{s_d} \sqrt{n}. \]

Note the difference \(\bar{d} = \frac{1}{n}\sum_{i=1}^n{d_i}\) and the variance

\[ s^2_d = \frac{\sum_{i=1}^n({d_i} - \bar{d})^2}{n-1}. \]

Traditionally, the \(t\)-test is based on the assumptions of …

  1. Normality and
  2. Variance homogeneity (i.e., equal sample variances). Note that this does not apply to the \(t\)-test after Welch, which can handle unequal variances.

By hand:

# Subset the data by sex
data_m <- data_vowels[data_vowels$SEX == "M", ]
data_f <- data_vowels[data_vowels$SEX == "F", ]

# Compute sample means
mean_m <- mean(data_m$HZ_F1)
mean_f <- mean(data_f$HZ_F1)

# Compute sample variances
var_m <- var(data_m$HZ_F1)
var_f <- var(data_f$HZ_F1)

# Determine sample sizes
n_m <- length(data_m$HZ_F1)
n_f <- length(data_f$HZ_F1)

# Compute t-statistic
t_statistic <- abs(mean_m - mean_f) / sqrt((var_m / n_m) + (var_f / n_f))

# Compute degrees of freedom (simple version)
df <- n_m + n_f - 2

# Find the p-value using the cumulative distribution function (CDF) of the t-distribution
p_value <- 2 * pt(-t_statistic, df)

Or, more concisely:

t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, paired = FALSE) # there is a significant difference!

    Welch Two Sample t-test

data:  data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
  8.403651 80.758016
sample estimates:
mean in group F mean in group M 
       528.8548        484.2740 
Important

If at least one assumption of the \(t\)-test has been violated, it is important to use a non-parametric test such as the Wilcoxon-Mann-Whitney (WMW) U-Test instead. In essence, this test compares the probabilities of encountering a value \(x\) from sample \(X\) that is greater than a value \(y\) from sample \(Y\). For details, see ?wilcox.test().

5.2.3 Constructing the critical region

An important question remains: How great should the difference be for us to reject \(H_0\)? The \(p\)-value measures the probability of encountering a specific value of a test statistic under the assumption that \(H_0\) holds. For example, a \(p\)-value of \(0.02\) means that we would see a particular \(\chi^2\)-score (or \(T\), \(F\) etc.) only 2% of the time if \(X\) and \(Y\) were unrelated (or if there was no difference between \(\bar{x}\) and \(\bar{y}\), respectively). Since our significance level \(\alpha\) is set to \(0.05\), we only reject the null hypothesis if this probability is lower than 5%.

We obtain \(p\)-values by consulting the probability density functions of the underlying distributions:

  • Probability density function for the \(\chi^2\)-distribution with \(df = 1\)
Code
# Generate random samples from a chi-squared distribution with 1 degree of freedom
x <- rchisq(100000, df = 1)

# Create histogram
hist(x,
     breaks = "Scott",
     freq = FALSE,
     xlim = c(0, 20),
     ylim = c(0, 0.2),
     ylab = "Probability density of observing a specific score",
     xlab = "Chi-squared score",
     main = "Histogram for a chi-squared distribution with 1 degree of freedom (df)",
     cex.main = 0.9)

# Overlay PDF
curve(dchisq(x, df = 1), from = 0, to = 150, n = 5000, col = "orange", lwd = 2, add = TRUE)

  • Probability density function for the \(t\)-distribution with \(df = 112.19\)
Code
# Given t-statistic and degrees of freedom
t_statistic <- 2.4416
df <- 112.19

# Generate random samples from a t-distribution with the given degrees of freedom
x <- rt(100000, df = df)

# Create histogram
hist(x,
     breaks = "Scott",
     freq = FALSE,
     xlim = c(-5, 5),
     ylim = c(0, 0.4),
     ylab = "Probability density of observing a specific score",
     xlab = "t-score",
     main = "Histogram for a t-distribution with 112.19 degrees of freedom",
     cex.main = 0.9)

# Overlay PDF
curve(dt(x, df = df), from = -5, to = 5, n = 5000, col = "orange", lwd = 2, add = TRUE)

5.3 Workflow in R

5.3.1 \(\chi^2\)-test

5.3.1.1 Define hypotheses

  • \(H_0:\) The variables ORDER and SUBORDTYPE are independent.

  • \(H_1:\) The variables ORDER and SUBORDTYPE are not independent.

5.3.1.2 Descriptive overview

We plot the distribution of clause ORDER depending on SUBORDTYPE. This requires (a) selecting the desired variables, (b) computing the token frequencies and (c) computing the percentages.

Code
# Filter data so as to show only those observations that are relevant

data %>% 
  # Filter columns
  select(ORDER, SUBORDTYPE) %>%
    # Count observations 
    count(ORDER, SUBORDTYPE) %>%  
    # Compute percentages
    mutate(pct = n/sum(n) * 100) -> data_order_subord

knitr::kable(data_order_subord)
ORDER SUBORDTYPE n pct
mc-sc caus 184 45.657568
mc-sc temp 91 22.580645
sc-mc caus 15 3.722084
sc-mc temp 113 28.039702

Next, we visualise the ORDER distribution using a barplot with a custom y-axis, requiring geom_col().

Code
# Plot distribution

data_order_subord %>%
  # Map variables onto axes
  ggplot(aes(x = ORDER, y = pct, fill = SUBORDTYPE)) +
    # Define plot type
    geom_col(pos = "dodge") +
    # Define theme
    theme_classic()

5.3.1.3 Running the test

# Cross-tabulate the frequencies for the variables of interest

freqs <- table(data$ORDER, data$SUBORDTYPE)

freqs ## Assumption met: all observed freqs => 1
       
        caus temp
  mc-sc  184   91
  sc-mc   15  113
# Run a chis-quared test on the absolute frequencies and print the results

test <- chisq.test(freqs, correct = FALSE)

# Inspect expected frequencies

test$expected # Assumption met: all expected frequences => 5
       
             caus      temp
  mc-sc 135.79404 139.20596
  sc-mc  63.20596  64.79404

5.3.1.4 Optional: Effect size

The sample-size independent effect size measure Cramer’s V (\(\phi\)) is defined as

\[V = \sqrt{\frac{\chi^2}{N \times df}}.\] The outcome varies between \(0\) (= no correlation) and \(1\) (= perfect correlation); cf. also Gries (2013: 186).

Code
# Compute Cramer's V

## By hand:

# Given chi-squared statistic
chi_squared <- unname(test$statistic)

# Total number of observations
total_obs <- sum(freqs)

sqrt(chi_squared / total_obs * (min(dim(freqs)) - 1))
[1] 0.5139168
Code
## Automatically:
library("confintr") # Load library

cramersv(test)
[1] 0.5139168

5.3.1.5 Reporting the results

According to a \(\chi^2\)-test, there is a significant association between clause ORDERand SUBORDTYPE at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).

5.3.2 \(t\)-test

5.3.2.1 Define hypotheses

  • \(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.

  • \(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.

5.3.2.2 Descriptive overview

We select the variables of interest and proceed calculate the mean F1 frequencies for each level of SEX, requiring a grouped data frame.

Code
# Filter data so as to show only those observations that are relevant

data_vowels %>% 
  # Filter columns
  select(HZ_F1, SEX) %>%
    # Define grouping variable
    group_by(SEX) %>% 
      # Compute mean and standard deviation for each sex
      summarise(mean = mean(HZ_F1),
                sd = sd(HZ_F1)) -> data_vowels_stats

knitr::kable(data_vowels_stats)
SEX mean sd
F 528.8548 110.80099
M 484.2740 87.90112
Code
# Plot distribution

## Plot means

data_vowels_stats %>% 
  ggplot(aes(x = SEX, y = mean)) +
    geom_col() +
    geom_errorbar(aes(x = SEX,
                    ymin = mean-sd,
                    ymax = mean+sd), width = .2) +
    theme_classic()

Code
## Plot quartiles
data_vowels %>% 
  ggplot(aes(x = SEX, y = HZ_F1)) +
    geom_boxplot() +
    theme_classic()

5.3.2.3 Check \(t\)-test assumptions

# Normality

shapiro.test(data_vowels$HZ_F1) # H0: data points follow the normal distribution

    Shapiro-Wilk normality test

data:  data_vowels$HZ_F1
W = 0.98996, p-value = 0.5311
# Check histogram

ggplot(data_vowels, aes(x = HZ_F1)) +
  geom_histogram(bins = 30) +
  theme_classic()

# Variance homogeneity

var.test(data_vowels$HZ_F1 ~ data_vowels$SEX) # H0: variances are not too different from each other

    F test to compare two variances

data:  data_vowels$HZ_F1 by data_vowels$SEX
F = 1.5889, num df = 59, denom df = 59, p-value = 0.07789
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.949093 2.660040
sample estimates:
ratio of variances 
          1.588907 

5.3.2.4 Running the test

# t-test for two independent samples 

t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, paired = FALSE) # there is a significant difference!

    Welch Two Sample t-test

data:  data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
  8.403651 80.758016
sample estimates:
mean in group F mean in group M 
       528.8548        484.2740 

5.3.2.5 Optional: Effect size

Cohen’s d is a possible effect size measure for continuous data and is obtained by dividing the difference of both sample means by the pooled standard deviation:

\[\frac{\bar{x} - \bar{y}}{\sqrt{\frac{{(n_1 - 1)s_x^2 + (n_2 - 1)s_y^2}}{{n_1 + n_2 - 2}}}}.\]

Code
library("effsize")

# By hand:
## Compute pooled standard deviation sp
sp <- sqrt(((n_m - 1) * var_m + (n_f - 1) * var_f) / (n_m + n_f - 2))

## Compute Cohen's d
d <- abs(mean_m - mean_f) / sp

# Automatically:
cohen.d(data_vowels$HZ_F1, data_vowels$SEX) # see also ?cohen.d for more details

Cohen's d

d estimate: 0.4457697 (small)
95 percent confidence interval:
     lower      upper 
0.07976048 0.81177897 

5.3.2.6 Reporting the results

According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies of male and female speakers of Apache (\(t = 2.44\), \(df = 112.19\), \(p < 0.05\)). Therefore, \(H_0\) will be rejected.